Assignment

To use visual analytics to detect anomalies and suspicious behavior in movement and tracking data.

Nurulasyiqah Md. Taha https://www.linkedin.com/in/nurulasyiqah-md-taha/
07-13-2021

1. Introduction

VAST Challenge 2021 - The Kronos Incident (Mini-Challenge 2)

Between 20-21 Jan 2014, on the island country of Kronos, several employees of GAStech, a Tethys gas multinational, go missing.

To get to the bottom of this mystery, we will use visual analytic techniques to analyze data provided by GAStech to assist with law enforcement’s investigation and hopefully find the missing persons and bring them home safely. The data provided by GAStech covering the two weeks prior to the GAStech employees’ disappearance are as follows:

2. Objective

The objective of this assignment is to use visual analytic techniques in purely R to surface and identify anomalies and suspicious behavior.

3. Literature review

3.1 Previous work

Since VAST Challenge 2021 is a redux of VAST Challenge 2014 with some variation, modified data and new questions, we reviewed past submissions for Mini-Challenge 2 for some inspiration on tackling this year’s challenge.

We were particularly drawn to the following submissions:

Eager to reproduce and improve the visualisations, we looked further into the dataset provided for the 2014 Challenge. While the datasets provided in 2014 were generally similar, we noted a significant difference in the 2021 datasets: Both the credit card transactions data and loyalty card transactions data do not reflect the name of the card holders.

This presents a further challenge for us to reproduce the inspiring visualisations above (especially the Social Relationship Matrix) and we have to consider alternatives or adaptations in some cases.

We browsed the R Graph Gallery for ideas on various types of charts and graphs that can be visualised using relevant R packages.

3.3 R packages

4. Data preparation

4.1 Data cleaning

Install and launch R packages

The following code chunk installs and launches the R packages we will use for data visualisation and analysis in this assignment.

packages = c('DT', 'ggiraph', 'ggplot2', 'plotly', 
             'tidyverse', 'raster', 'sf', 'clock', 'tmap',
             'mapview', 'tidygraph', 'ggraph', 'visNetwork',
             'igraph', 'lubridate', 'clock')
              
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Import data files provided

The following code chunk imports the four comma-separated values (.csv) data files provided for this Mini-Challenge - car-assignments.csv, cc_data.csv, gps.csv and loyalty_data.csv. The “data” folder containing the four data files is saved in the same location (working directory) as the R Markdown document to publish this webpage.

car_assign<- read_csv("data/car-assignments.csv")
credit_card<- read_csv("data/cc_data.csv")
gps <- read_csv("data/gps.csv")
loyalty_card <- read_csv("data/loyalty_data.csv")

Take a glimpse of the “credit_card” dataset with the following code:

glimpse(credit_card)
Rows: 1,490
Columns: 4
$ timestamp  <chr> "01/06/2014 07:28", "01/06/2014 07:34", "01/06/20~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~

We need to change the “timestamp” column from character data type to date data type in m/d/Y H:M:S format and extract the “hour_of_day” and “day_of_week” for analysis later. We need to also change the “last4ccnum” column from numerical data type to character data type and rename it to “cc_id”.

credit_card$date <- date_time_parse(credit_card$timestamp,
                                             zone = "",
                                             format = "%m/%d/%Y")

credit_card$timestamp <- date_time_parse(credit_card$timestamp,
                                         zone = "",
                                         format = "%m/%d/%Y %H:%M")

credit_card$hour_of_day <- get_hour(credit_card$timestamp)

credit_card$day_of_week <- wday(credit_card$timestamp, label = TRUE, abbr = FALSE)

credit_card$last4ccnum <- as_factor(credit_card$last4ccnum)

credit_card <- credit_card %>%
  rename(cc_id = last4ccnum)

The cleaned version of “credit_card” dataset is as follows:

glimpse(credit_card)
Rows: 1,490
Columns: 7
$ timestamp   <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-~
$ location    <chr> "Brew've Been Served", "Hallowed Grounds", "Brew~
$ price       <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.~
$ cc_id       <fct> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, ~
$ date        <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06,~
$ hour_of_day <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ~
$ day_of_week <ord> Monday, Monday, Monday, Monday, Monday, Monday, ~

Next, we take a glimpse of the “loyalty_card” dataset with the following code:

glimpse(loyalty_card)
Rows: 1,392
Columns: 4
$ timestamp  <chr> "01/06/2014", "01/06/2014", "01/06/2014", "01/06/~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~

We need to change the “timestamp” column from character data type to date data type in m/d/Y format and extract the “day_of_week” for analysis later.

loyalty_card <- loyalty_card %>%
  mutate(timestamp = date_time_parse(timestamp,
                                     zone = "",
                                     format = "%m/%d/%Y")) %>%
  rename(date = timestamp)

loyalty_card$day_of_week <- wday(loyalty_card$date,
                               label = TRUE,
                               abbr = FALSE)

The cleaned version of “loyalty_card” dataset is as follows:

glimpse(loyalty_card)
Rows: 1,392
Columns: 5
$ date        <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06,~
$ location    <chr> "Brew've Been Served", "Brew've Been Served", "H~
$ price       <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15~
$ loyaltynum  <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L4~
$ day_of_week <ord> Monday, Monday, Monday, Monday, Monday, Monday, ~

Next, we take a glimpse of the “gps” dataset with the following code:

glimpse(gps)
Rows: 685,169
Columns: 4
$ Timestamp <chr> "01/06/2014 06:28:01", "01/06/2014 06:28:01", "01/~
$ id        <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~

We need to change the “Timestamp” column from character data type to date data type in m/d/Y H:M:S format and extract the “hour_of_day” and “day_of_week” for analysis later. We need to also change the “id” column from numerical data type to character data type and rename it to “CarID”.

gps$date <- date_time_parse(gps$Timestamp,
                            zone = "",
                            format = "%m/%d/%Y")

gps$Timestamp <- date_time_parse(gps$Timestamp,
                                 zone = "",
                                 format = "%m/%d/%Y %H:%M")

gps$hour_of_day <- get_hour(gps$Timestamp)

gps$day_of_week <- wday(gps$Timestamp, label = TRUE, abbr = FALSE)

gps$id <- as_factor(gps$id)

gps <- gps %>%
  rename(CarID = id)

The cleaned version of “gps” dataset is as follows:

glimpse(gps)
Rows: 685,169
Columns: 7
$ Timestamp   <dttm> 2014-01-06 06:28:00, 2014-01-06 06:28:00, 2014-~
$ CarID       <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, ~
$ lat         <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621~
$ long        <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417~
$ date        <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06,~
$ hour_of_day <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ~
$ day_of_week <ord> Monday, Monday, Monday, Monday, Monday, Monday, ~

Next, we take a glimpse of the “car_assign” dataset with the following code:

glimpse(car_assign)
Rows: 44
Columns: 5
$ LastName               <chr> "Calixto", "Azada", "Balas", "Barranc~
$ FirstName              <chr> "Nils", "Lars", "Felix", "Ingrid", "I~
$ CarID                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
$ CurrentEmploymentType  <chr> "Information Technology", "Engineerin~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "Engineer", "Engineer"~

We need to also change the “CarID” column from numerical data type to character data type. We will also concatenate the “LastName” and “FirstName” of the employee into “EmployeeName”.

car_assign$CarID <- as_factor(car_assign$CarID)

car_assign <- car_assign %>%
  unite(EmployeeName, c("FirstName", "LastName"), sep = " ")

The cleaned version of “car_assign” dataset is as follows:

glimpse(car_assign)
Rows: 44
Columns: 4
$ EmployeeName           <chr> "Nils Calixto", "Lars Azada", "Felix ~
$ CarID                  <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
$ CurrentEmploymentType  <chr> "Information Technology", "Engineerin~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "Engineer", "Engineer"~

4.2 Geo-referencing

As the Abila, Kronos, tourist map image file provided (MC2-tourist.jpg) does not have coordinates projection, we need to perform geo-referencing. We used a geographic information system software, QGIS, to geo-reference the image file with the Abila shapefile (in WGS84 coordinates) to obtain a geo-referenced tourist map (MC2-tourist.tif).

5. Visual analysis

When we first tried to run the code for the visualisation, we were prompted the following error message:

invalid input ‘Katerina’s Café’ in ‘utf8towcs’

Therefore, we ran the followings codes in the “credit_card” and “loyalty_card” datasets to resolve the accent characters.

credit_card <- credit_card %>%
  mutate(location = ifelse(str_detect(location, "Katerina"), "Katerina's Cafe", location))

loyalty_card <- loyalty_card %>%
  mutate(location = ifelse(str_detect(location, "Katerina"), "Katerina's Cafe", location))

Most popular locations based on credit card data

credit_card %>%
  mutate(location = fct_rev(fct_infreq(location))) %>%
  ggplot(aes(x = location)) +
  geom_bar(colour = "grey", fill = "dark blue") +
  xlab("Location") +
  ylab("No. of credit card transactions") +
  theme(axis.text.x = element_text(vjust = 0.5, hjust=1)) +
  coord_flip()

Based purely on the number of credit card transactions at each location over the course of two weeks, the top five popular locations were:

  1. Katerina’s Cafe
  2. Hippokampos
  3. Guy’s Gyros
  4. Brew’ve Been Served
  5. Hallowed Grounds

Most popular locations based on loyalty card data

loyalty_card %>%
  mutate(location = fct_rev(fct_infreq(location))) %>%
  ggplot(aes(x = location)) +
  geom_bar(colour = "grey", fill = "dark green") +
  xlab("Location") +
  ylab("No. of loyalty card transactions") +
  theme(axis.text.x = element_text(vjust = 0.5, hjust=1)) +
  coord_flip()

Based purely on the number of loyalty card transactions at each location over the course of two weeks, the top five popular locations were:

  1. Katerina’s Cafe
  2. Hippokampos
  3. Guy’s Gyros
  4. Brew’ve Been Served
  5. Ouzeri Elian

Most popular hours of the day based on credit card data

cc_hour_of_day <- credit_card %>%
  count(location, hour_of_day) %>%
  mutate(location = as.factor(location),
         hour_of_day = as.factor(hour_of_day),
         text = paste0("Location: ", location, "\n", 
                       "Hour of the day: ", hour_of_day, "\n", 
                       "Count: ", n))

hMap_hour_of_day <- ggplot(cc_hour_of_day, aes(hour_of_day,
                                             location,
                                             fill = n,
                                             text = text)) +
  geom_tile() +
  scale_fill_viridis(discrete = FALSE) +
  scale_y_discrete() +
  scale_x_discrete() +
    xlab("Hour of the day") +
  theme(panel.grid.major = element_blank(),
        axis.ticks.x = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.x = element_text(size = 6),
        axis.text.y = element_text(size = 5),
        axis.title.y = element_blank())

ggplotly(hMap_hour_of_day, tooltip = "text")

Brew’ve Been Served is a highly popular morning coffee place, as seen from the high number of transactions at between 7am and 9am. Dining at Katerina’s Cafe for dinner appear to be very popular option, based on the high number of transactions between 8pm to 9pm.

During the two-week period of transactions, Daily Dealz recorded only 1 transaction at 6am, which is ironic for a place that suggests good deals on a daily basis. This piques our interest and we hope further investigation can reveal more information.

Most popular days of the week based on credit card data

cc_day_of_week <- credit_card %>%
  count(location, day_of_week) %>%
  mutate(location = as.factor(location),
         day_of_week = as.factor(day_of_week),
         text = paste0("Location: ", location, "\n", 
                       "Day: ", day_of_week, "\n", 
                       "Count: ", n))

hMap_cc_day <- ggplot(cc_day_of_week, aes(day_of_week,
                                             location,
                                             fill = n,
                                             text = text)) +
  geom_tile() +
  scale_fill_viridis(discrete = FALSE) +
  scale_y_discrete() +
  scale_x_discrete() +
    xlab("Day of the week") +
  theme(panel.grid.major = element_blank(),
        axis.ticks.x = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.x = element_text(size = 5),
        axis.text.y = element_text(size = 5),
        axis.title.y = element_blank())

ggplotly(hMap_cc_day, tooltip = "text")

Most popular days of the week based on loyalty card data

lc_day_of_week <- loyalty_card %>%
  count(location, day_of_week) %>%
  mutate(location = as.factor(location),
         day_of_week = as.factor(day_of_week),
         text = paste0("Location: ", location, "\n", 
                       "Day: ", day_of_week, "\n", 
                       "Count: ", n))

hMap_lc_day <- ggplot(cc_day_of_week, aes(day_of_week,
                                             location,
                                             fill = n,
                                             text = text)) +
  geom_tile() +
  scale_fill_viridis(discrete = FALSE) +
  scale_y_discrete() +
  scale_x_discrete() +
    xlab("Day of the week") +
  theme(panel.grid.major = element_blank(),
        axis.ticks.x = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.x = element_text(size = 5),
        axis.text.y = element_text(size = 5),
        axis.title.y = element_blank())

ggplotly(hMap_lc_day, tooltip = "text")

Credit card transactions and loyalty card transactions are aligned with each other throughout the week for both Katerina’s Cafe and Brew’ve Been Served.

5.2 Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?

Merging “car_assign” dataset with “gps” dataset

As employees are assigned a car, we would merge the both data sets using left join to cater for trucks that would not be assigned to any employee.

emp_gps <- gps %>%
  left_join(car_assign, by = c("CarID" = "CarID"))

The merged data set “emp_gps” is as follows:

glimpse(emp_gps)
Rows: 685,169
Columns: 10
$ Timestamp              <dttm> 2014-01-06 06:28:00, 2014-01-06 06:2~
$ CarID                  <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 3~
$ lat                    <dbl> 36.07623, 36.07622, 36.07621, 36.0762~
$ long                   <dbl> 24.87469, 24.87460, 24.87444, 24.8742~
$ date                   <dttm> 2014-01-06, 2014-01-06, 2014-01-06, ~
$ hour_of_day            <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6~
$ day_of_week            <ord> Monday, Monday, Monday, Monday, Monda~
$ EmployeeName           <chr> "Willem Vasco-Pais", "Willem Vasco-Pa~
$ CurrentEmploymentType  <chr> "Executive", "Executive", "Executive"~
$ CurrentEmploymentTitle <chr> "Environmental Safety Advisor", "Envi~

Create a map of Abila, Kronos

Import the raster file (MC2-tourist.tif) into an object, bgmap:

bgmap <- raster("data/Geospatial/MC2-tourist.tif")

Run the following code to create a static map:

tmap_mode("plot")
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1, g = 2, b = 3,
       alpha = NA,
       saturation = 1,
       interpolate = TRUE,
       max.value = 255)

Import vector GIS data file “Abila”:

Abila_st <- st_read(dsn = "data/Geospatial", layer = "Abila")
Reading layer `Abila' from data source 
  `C:\syiqah\VASQ\_posts\2021-07-13-assignment\data\Geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS:  WGS 84

Convert the GPS data into geometric simple feature data:

emp_gps_sf <- st_as_sf(emp_gps,
                   coords = c("long", "lat"),
                   crs = 4326)

Create movement path from GPS points

The following code chunk joins the GPS points into movement paths using the car ID as a unique identifier.

gps_path <- emp_gps_sf %>%
  group_by(CarID) %>%
  summarise(m = mean(Timestamp),
            do_union = FALSE) %>%
  st_cast("LINESTRING")

gps_path
Simple feature collection with 40 features and 2 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82509 ymin: 36.04802 xmax: 24.90849 ymax: 36.08996
Geodetic CRS:  WGS 84
# A tibble: 40 x 3
   CarID m                                                    geometry
   <fct> <dttm>                                       <LINESTRING [°]>
 1 1     2014-01-12 04:35:03 (24.88258 36.06646, 24.88259 36.06634, 2~
 2 2     2014-01-12 02:49:14 (24.86038 36.08546, 24.86038 36.08551, 2~
 3 3     2014-01-12 10:12:57 (24.85763 36.08668, 24.85743 36.08662, 2~
 4 4     2014-01-12 13:44:21 (24.87214 36.07821, 24.87206 36.07819, 2~
 5 5     2014-01-12 13:34:35 (24.8779 36.0673, 24.87758 36.06736, 24.~
 6 6     2014-01-12 02:05:27 (24.89477 36.05949, 24.89475 36.05939, 2~
 7 7     2014-01-12 14:31:43 (24.86424 36.08449, 24.86421 36.08438, 2~
 8 8     2014-01-12 13:20:01 (24.88597 36.0673, 24.88595 36.06725, 24~
 9 9     2014-01-12 16:16:56 (24.85095 36.08172, 24.85095 36.08175, 2~
10 10    2014-01-12 16:31:24 (24.86589 36.07682, 24.86587 36.07676, 2~
# ... with 30 more rows

Plotting GPS path of employee cars

We should first check for any orphan lines (only 1 pair of coordinates cannot form a polyline path) and remove them.

p = npts(gps_path, by_feature = TRUE)
gps_path2 <- cbind(gps_path, p)

gps_path_sort <- gps_path2 %>%
  arrange(p)
gps_path_sort
Simple feature collection with 40 features and 3 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 24.82509 ymin: 36.04802 xmax: 24.90849 ymax: 36.08996
Geodetic CRS:  WGS 84
First 10 features:
   CarID                   m     p                       geometry
1     31 2014-01-18 04:48:52  2317 LINESTRING (24.89519 36.070...
2    105 2014-01-14 06:40:48  9732 LINESTRING (24.87957 36.048...
3    106 2014-01-14 06:09:10 12278 LINESTRING (24.87957 36.048...
4     23 2014-01-12 13:40:03 12321 LINESTRING (24.90105 36.058...
5     30 2014-01-13 13:37:09 14005 LINESTRING (24.90104 36.058...
6     20 2014-01-12 20:43:10 14184 LINESTRING (24.90107 36.065...
7     29 2014-01-12 10:14:17 14332 LINESTRING (24.89485 36.059...
8     22 2014-01-12 19:07:59 14417 LINESTRING (24.90102 36.058...
9     16 2014-01-11 23:50:02 14506 LINESTRING (24.90563 36.060...
10    25 2014-01-12 21:51:02 14540 LINESTRING (24.89482 36.059...

We observed that there is no car with orphan lines since there is no p = 1 in the gps_path_sort table above. We can therefore plot the paths for all Car IDs using the following code chunk:

Route by Car ID 1 (assigned to Nix Calixto)

gps_path_selected <- gps_path %>%
  filter(CarID == 1) # Replace the CarID
tmap_mode("view")
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1, g = 2, b = 3,
       alpha = NA,
       saturation = 1,
       interpolate = TRUE,
       max.value = 255) +
  tm_shape(gps_path_selected) +
  tm_lines()

Route by Car ID 2 (assigned to Lars Azada)

Route by Car ID 3 (assigned to Felix Balas)

Route by Car ID 4 (assigned to Ingrid Barranco)

Route by Car ID 5 (assigned to Isak Baza)

Route by Car ID 6 (assigned to Linnea Bergen)

Route by Car ID 7 (assigned to Elsa Orilla)

Route by Car ID 8 (assigned to Lucas Alcazar)

Route by Car ID 9 (assigned to Gustav Cazar)

Route by Car ID 10 (assigned to Ada Campo-Corrente)

Route by Car ID 11 (assigned to Axel Calzas)

Route by Car ID 12 (assigned to Hideki Cocinaro)

Route by Car ID 13 (assigned to Inga Ferro)

Route by Car ID 14 (assigned to Lidelse Dedos)

Route by Car ID 15 (assigned to Loreto Bodrogi)

Route by Car ID 16 (assigned to )

Route by Car ID 17 (assigned to )

Route by Car ID 18 (assigned to )

Route by Car ID 19 (assigned to )

Route by Car ID 20 (assigned to )

Route by Car ID 21 (assigned to )

Route by Car ID 22 (assigned to )

Route by Car ID 23 (assigned to )

Route by Car ID 24 (assigned to )

Route by Car ID 25 (assigned to )

Route by Car ID 26 (assigned to )

Route by Car ID 27 (assigned to )

Route by Car ID 28 (assigned to )

Route by Car ID 29 (assigned to )

Route by Car ID 30 (assigned to )

Route by Car ID 31 (assigned to )

Route by Car ID 32 (assigned to )

Route by Car ID 33 (assigned to )

Route by Car ID 34 (assigned to )

Route by Car ID 35 (assigned to )

5.3 Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data?

5.4 Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships.

5.5 Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why.